Skip to content

Feat/pack mind#2298

Open
sharkqwy wants to merge 35 commits into
dimensionalOS:mainfrom
sharkqwy:feat/pack-mind
Open

Feat/pack mind#2298
sharkqwy wants to merge 35 commits into
dimensionalOS:mainfrom
sharkqwy:feat/pack-mind

Conversation

@sharkqwy
Copy link
Copy Markdown

@sharkqwy sharkqwy commented May 28, 2026

PACK MIND — shared operational memory for a robot pack (experimental)

One brain, many bodies, one memory that outlives any single dog.

Multi-robot search usually means merging maps — brittle, months of SLAM. PACK MIND
skips that: each dog keeps its own map and the pack shares only meaning — which
zones are searched and what was found, by name, never coordinates. So robots
running independent SLAM frames coordinate through one lightweight ledger. One dog
sees the target → the whole pack knows instantly. One dog drops → its discoveries
persist and a teammate inherits its unfinished ground. Share a mind, not a map.

What's in the PR (all under dimos/experimental/pack_mind/)

Sim (the A/B proof, pure numpy — no GPU/ROS/sim deps):

  • explore_sim.py — fog-of-war frontier exploration with shared-vs-private discovered
    maps + frontier de-confliction + object search.
  • view_explore_rerun.py — DimOS Viewer (Rerun) view; shared-vs-independent A/B + kill-a-dog.
  • server.py + static/explore.html — web 3D side-by-side fog-of-war race (shared vs
    independent, kill/reset).

Live pack (one dog per laptop):

  • pack_coordinator.py (+ _server.py, pack_dashboard.html) — zone ledger (provably
    no double-assign), finding blackboard with stop-on-find, inheritance
    (release_dog), HTTP API + projector dashboard.
  • pack_search_skills.py — agent tools (start_search/next_zone/report_*/where_is).
  • red_detector.pyGPU-free "red object" find (HSV-free RGB ratio test) — fast,
    deterministic where a VLM is too heavy.
  • velocity_teleop.py + demo_drive.py — direct cmd_vel teleop (bypasses the
    planner's flaky is_goal_reached RPC) + keyboard driver with auto-detect-on-stop.
  • live.pyunitree-go2-pack blueprint, CPU/macOS-deployable (disables CUDA-only
    EdgeTAM modules).
  • mock_dog.py, demo_pack_live.py, demo_pack_scene.py, prefetch_live_models.py.

Tests: 58 passing (coordinator, HTTP server, search runner, red detector, sim engine).
Registry: taught the blueprint scanner to recognize .disabled_modules() so
disabled-module blueprints register.

Demo

  • 🎥 90-second video: https://youtu.be/Ggv3aS5rDCM/ — live Alpha→Bravo handoff (one find, the pack knows)
    • sim A/B coverage + the resilience beat (lose a dog, knowledge persists).
  • 🖼 Slides: https://pack-mind.pages.dev/ — the idea, the handoff + inheritance beats, the A/B proof, and the fleet-memory-layer business case.

Honest scope / design notes

  • Experimental. Sim shows cell-level shared map; live is zone-level (two real
    dogs = two SLAM frames — merging them live is the brittle thing we deliberately don't do).
  • Live movement is teleop-assisted, detection is a color filter (not a VLM) —
    chosen for reliability on a no-GPU ground station. Full autonomy is shown in sim.

TBD / follow-ups (not in this PR)

  • Multi-dog / multi-laptop validation. The coordinator is N-agnostic (HTTP + zone
    names + env-configured identity), but only validated at 2 dogs. Field work needed:
    cross-laptop HTTP reachability (firewall/subnet), STA-mode dog connections,
    zone-name consistency across frames (the physical meaning of "no overlap"),
    LCM host-isolation, and a 3-dog test (no-overlap + one-of-three-offline inheritance).
    Plan: validate cross-machine with mock_dogs first, then swap in real dogs.
  • macOS core enablers (SHM video-bridge subscription, moondream MPS device) — kept
    out of this PR; they overlap with the Zenoh transport work in Default to Zenoh transport on macOS and document replay workflow #2106 and should
    land there/separately.
  • Coordinator replication — currently a single point; a real fleet wants
    gossip/replication.

Test plan

bin/pytest-fast dimos/experimental/pack_mind/          # 58 unit/integration tests
# hardware-free demo + multi-actor logic:
uv run python -m dimos.experimental.pack_mind.demo_pack_live --pace 2   # dashboard at :8090
uv run python -m dimos.experimental.pack_mind.server                   # web A/B at :8000
# live (per README "Live demo" section): dimos run unitree-go2-pack

sharkqwy added 30 commits May 26, 2026 10:46
Shared semantic memory for a Go2 team. Conductor (roster + append-only
blackboard + deterministic mission state machine + movement lock) talks to
each dog over MCP JSON-RPC; cross-dog handoff is by zone name only, no
coordinates. Browser dashboard renders the causal chain. --mock runs the
full Alpha->Bravo story with no hardware.

dimos/experimental/pack_mind/
Load big_office_simple_occupancy.png into a fog-of-war search arena:
remap PNG values (15=free/105=wall/0=exterior), downsample, morphological
close to bridge speckle, keep largest connected floor. spread_starts()
picks deterministic spread deployment. build_explore_building() + server
PACK_MIND_MAP=building switch. On the 27x37m office floor the shared-vs-
independent gap widens (shared clears 100%, independent stalls ~39%).
Fixed Fog(60,140) blacked out the larger building floor (camera ~274u from
target, beyond fog far=140). Scale fog near/far by span; read cell res from
state instead of assuming 0.1m so dogs land on the right cells.
When joined to the robot's own WiFi hotspot the gateway is always
192.168.12.1 and the peer expects the LocalAP handshake (empty SDP id).
The previous hardcoded LocalSTA method sends id="STA_localNetwork", which
some firmware rejects in AP mode, causing the WebRTC offer to hang with no
answer. Auto-select LocalAP for the AP gateway IP, keep LocalSTA otherwise.
The dimos.msgs packages declare __all__ = [], so the previous
getattr(module, "__all__", dir(module)) resolved to an empty list (present
but falsy) and injected nothing into the eval context — every expression
failed with "name 'Twist' is not defined". Walk same-named submodules
(geometry_msgs.Twist.Twist, etc.) when __all__ is empty so message classes
are available to the expression.
Keyboard teleop needs a pygame window (x11/cocoa), unavailable over SSH or
headless. demo_drive_go2.py publishes Twist on /cmd_vel at a fixed rate so a
running relay blueprint's ControlCoordinator forwards motion to the robot
over WebRTC — no GUI required. Useful for live-hardware bring-up and demos.
- explore_sim: shared-map frontier de-confliction so the pack fans out instead of
  re-walking each other's ground; plant a search target at the farthest free cell,
  detect-on-sight, and converge-on-found
- view_explore_rerun: DimOS Viewer (Rerun) view of the maze/office search with
  shared-vs-independent A/B and kill-a-dog resilience
- README: document the viewer commands
Laptop-side brain that shares MEANING (which zones are searched, what was found)
over HTTP so two dogs on two laptops never re-search the same area, and the
mission survives a dog going offline. Zone NAMES only — never coordinates — so two
independent SLAM frames work together without map merging.

- pack_coordinator: zone ledger (provably no double-assign), report_finding
  (stop-on-find + claimed-zone fallback), release_dog (survivor inherits a downed
  dog's unfinished zones; findings persist)
- pack_coordinator_server: JSON/HTTP API + projector dashboard at / + fan-out prefs
- pack_dashboard.html: live view (zones, finding, offline/inheritance, causal chain)
- pack_search_skills: dog agent tools (start_search/next_zone/report_*/where_is)
- pack_search_runner: RobotDriver-protocol search loop + MockDriver
- live.py: unitree-go2-pack blueprint (one dog per laptop, env-configured) + prompt
- mock_dog + demo_pack_scene: hardware-free test/rehearsal of handoff + inheritance
- tests: 24 passing (coordinator, HTTP server, runner)
- demo_pack_live: one command starts the coordinator + dashboard and plays the v4
  inheritance climax with pauses, so the projector animates the full story (two
  dogs fan out with no overlap, a dog goes offline, the survivor inherits its
  unfinished zone and finds the object there)
- mock_dog: add --target-zone (deterministic find in one place, like reality),
  --reset (call start_search first so re-runs reset the ledger), and --dwell
  (slow per-zone pacing for the dashboard)
Render the fog floor via OccupancyGrid.to_rerun (matching the live SLAM map),
extrude explored walls into Boxes3D for real volume, and aim an orbital camera at
the arena centre so the scene auto-frames instead of landing edge-on.
The --3d mode rendered the occupancy floor + extruded walls but never wired
a working playback timeline in rerun 0.32 (recording showed no scrubbable
tick axis), so it was a dead end for the demo. Real 3D map visuals come from
LiDAR replay (`dimos run unitree-go2`), not the 2D coverage sim. Keep the
working 2D fog A/B (--independent for the shared-vs-private baseline); that is
the deliverable that proves the shared-memory thesis.
…dules)

EdgeTAM segmentation hard-requires a CUDA GPU and reaches the spatial stack two
ways, both fatal on a CPU/CoreML ground station (e.g. a Mac):
  - unitree_go2_spatial -> SecurityModule.__init__ -> EdgeTAMProcessor()  (deploy crash)
  - PersonFollowSkillContainer.follow_person -> EdgeTAMProcessor()         (mid-demo crash)
The pack demo uses neither, so disable both via .disabled_modules(). The blueprint
then deploys clean on CPU with no --disable flag and no call-time landmine.
Verified: active_blueprints = 19 modules, both excluded, PackSearchSkills + MCP present.
So a second laptop+dog can collaborate from a clean checkout:
- prefetch_live_models.py: one command caches the runtime models the live stack
  needs (moondream2 for look_out_for, faster-whisper-base for STT) so the demo runs
  fully offline (HF_HUB_OFFLINE=1) on a dog's internet-less AP / flaky venue WiFi
- README: 2-dog/2-laptop bring-up — per-laptop env + run, router/STA vs AP, dashboard,
  and the hard-won gotchas (no --robot-ip flag, EdgeTAM already disabled, offline mode,
  moondream first-call latency, bring-up gate order)
moondream VLM on a CPU ground station is slow (tens of seconds, async) and
fragile. The demo target is defined by COLOUR, so detect it with a colour filter:
- red_detector.py: RedObjectDetector watches the camera; @Skill look_for_red does
  a per-frame RGB ratio test (milliseconds, deterministic) and, on a hit, reports
  the finding to the coordinator (blank zone → coordinator fills the claimed zone)
- wired into unitree-go2-pack; PACK prompt now uses look_for_red instead of look_out_for
- 5 unit tests on the detection logic (hardware-free)
sharkqwy added 2 commits May 28, 2026 08:20
…hang)

relative_move runs the A* planner then waits on ReplanningAStarPlanner/is_goal_reached
over LCM — flaky on macOS, hangs 120s even though the dog reached the goal. New
VelocityTeleop publishes Twist straight to MovementManager's tele_cmd_vel lane: the
dog moves instantly, no planner, no confirmation RPC. demo_drive now uses `drive`
(velocity bursts) instead of relative_move; W/A/S/D map to forward/turn m/s·rad/s.
- remove the old "backpack handoff" demo (conductor, dashboard, sim_harness,
  venue_go2, RUNBOOK, test_conductor) — superseded by the exploration A/B + live pack
- recognize .disabled_modules() in the blueprint AST scanner so unitree-go2-pack
  (which disables the CUDA-only EdgeTAM modules) registers; regenerate all_blueprints
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ec6b00693b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +206 to +207
@rpc
def next_zone(self) -> str:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Expose next_zone as a pack skill

In the live unitree-go2-pack flow, the system prompt tells the MCP agent to call next_zone and should_stop during every mission loop, but MCP only lists methods returned by Module.get_skills(), which requires the @skill marker rather than plain @rpc. With these methods left as RPC-only, tools/list will not include them, so after start_search the agent cannot obtain assignments or observe the pack stop condition and the advertised autonomous search loop stalls.

Useful? React with 👍 / 👎.

Comment on lines +110 to +114
requests.post(
f"{self._url}/report_finding",
json={"dog": self._dog, "object": "red object", "zone": ""},
timeout=_REQUEST_TIMEOUT,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Raise on failed finding reports

When a red object is detected but the coordinator responds with a non-2xx HTTP status (for example a bad PACK_COORDINATOR_URL pointing at a server that returns 404/500), requests.post does not raise, so this skill still tells the agent the sighting was “reported to the pack” even though the coordinator did not record it. That leaves the finder stopping while teammates continue searching; mirror the other coordinator calls by checking raise_for_status() before returning success.

Useful? React with 👍 / 👎.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 28, 2026

Greptile Summary

PACK MIND adds shared operational memory for a multi-robot pack: a zone-name ledger + target blackboard served over HTTP, so dogs coordinate search coverage without exchanging coordinates or merging SLAM maps. The PR includes a pure-numpy fog-of-war sim (A/B proof), a live Go2 blueprint, 58 tests, and small improvements to the blueprint scanner and topic send CLI.

  • Coordinator + search loop (pack_coordinator.py, pack_search_runner.py): thread-safe PackCoordinator guarantees no-overlap zone assignment and instant stop-on-find; the run_search loop drives both real and mock robots through the same interface.
  • Live blueprint (live.py): unitree-go2-pack uses the new .disabled_modules() chain to strip CUDA-only EdgeTAM/PersonFollow so the blueprint deploys cleanly on a CPU/macOS ground station.
  • Infrastructure tweaks (connection.py, topic.py, test_all_blueprints_generation.py): auto-select WebRTC handshake mode based on AP gateway IP, fix topic send class resolution for msgs packages, and teach the blueprint scanner to recognise .disabled_modules() chains.

Confidence Score: 4/5

Safe to merge as experimental code; two defects in the search runner and red detector deserve a fix before a reliability demo.

Two concrete defects in the changed code: in pack_search_runner.py, zones the robot fails to physically reach remain permanently claimed and are never searched or released — coverage is silently incomplete. In red_detector.py, HTTP-level errors from the coordinator are not detected (no raise_for_status()), so the agent is told the finding was broadcast to the pack when the server may have rejected or dropped the request. Both issues affect the core demo flow.

dimos/experimental/pack_mind/pack_search_runner.py and dimos/experimental/pack_mind/red_detector.py

Important Files Changed

Filename Overview
dimos/experimental/pack_mind/pack_coordinator.py Core zone-ledger + target blackboard; thread-safe with RLock; no-overlap and stop-on-find guarantees look solid, but report_cleared has no zone-ownership check (already flagged in prior review threads)
dimos/experimental/pack_mind/pack_coordinator_server.py Thin HTTP wrapper over PackCoordinator; int() on Content-Length can raise ValueError on malformed headers (flagged in prior threads), otherwise request routing and JSON validation look correct
dimos/experimental/pack_mind/pack_search_runner.py Deterministic search loop; unreachable zones (goto returns False) remain permanently claimed and are never searched — coverage can be permanently incomplete without explicit release_dog
dimos/experimental/pack_mind/pack_search_skills.py LLM-agent skill bridge; where_is ignores the object argument (already flagged); HTTP timeout and error handling otherwise consistent with _post/_get pattern
dimos/experimental/pack_mind/red_detector.py GPU-free red-ratio detector; look_for_red does not call raise_for_status(), so HTTP-level coordinator errors are silently swallowed and the agent incorrectly believes the finding was broadcast
dimos/experimental/pack_mind/live.py Go2 pack blueprint using disabled_modules() to strip CUDA-only EdgeTAM and PersonFollow for CPU/macOS deployment; clear and correct
dimos/robot/unitree/connection.py Small change: auto-selects LocalAP vs LocalSTA WebRTC handshake based on whether IP matches the fixed AP gateway 192.168.12.1; logic is correct and well-commented
dimos/robot/test_all_blueprints_generation.py Added disabled_modules to BLUEPRINT_METHODS so the scanner recognises .disabled_modules() chains as blueprints; straightforward and correct
dimos/robot/cli/topic.py topic_send now uses pkgutil to walk submodules for msgs packages that declare empty all, fixing class resolution for message types; correct improvement

Sequence Diagram

sequenceDiagram
    participant DA as Dog Alpha (LLM agent)
    participant DB as Dog Bravo (LLM agent)
    participant PC as PackCoordinator (HTTP)

    Note over DA,PC: Mission start
    DA->>PC: POST /start_search
    PC-->>DA: Searching for red kit. 4 zones.

    Note over DA,DB: Zone assignment loop (no overlap)
    DA->>PC: "POST /assign_zone dog=alpha"
    PC-->>DA: "zone=north"
    DB->>PC: "POST /assign_zone dog=bravo"
    PC-->>DB: "zone=east"

    DA->>DA: navigate + look_for_red(north)
    DB->>DB: navigate + look_for_red(east)

    DA->>PC: "POST /report_cleared dog=alpha zone=north"
    PC-->>DA: north cleared.

    Note over DB: Red object found!
    DB->>PC: "POST /report_finding dog=bravo object=red object zone=east"
    PC-->>DB: finding recorded, pack stop

    DA->>PC: "GET /should_stop?dog=alpha"
    PC-->>DA: "stop=true"
    DA->>PC: GET /where_is
    PC-->>DA: "found=true zone=east by=bravo"

    Note over DA: Act on teammate memory
    DA->>DA: navigate_to(east)
Loading

Reviews (3): Last reviewed commit: "feat: PACK MIND live 2-dog runbook + pre..." | Re-trigger Greptile

Comment on lines +163 to +182
@skill
def where_is(self, object: str) -> str:
"""Ask the pack's shared memory where an object was found.

Use this to act on a TEAMMATE's discovery — you may never have seen the
object yourself. Returns the zone a packmate reported it in, or that it
hasn't been found yet.

Args:
object: What you're asking about, e.g. "red backpack".
"""
data = self._get("/where_is", {})
if data is None:
return "Could not reach the pack memory."
if not data.get("found"):
return f"No packmate has found the {object} yet."
return (
f"{data.get('by')} found the {data.get('object')} in {data.get('zone')}. "
f"I can take you there."
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 where_is ignores the object argument on the server query

The skill accepts object: str but self._get("/where_is", {}) passes an empty params dict — the queried object name is never sent to the server. The /where_is endpoint always returns the single current finding regardless of what object is being asked about. If the pack has found "red object" but the agent calls where_is("blue ball"), the coordinator still returns found=True and the response becomes "alpha found the red object in zone X. I can take you there." — the LLM receives a finding about a different object than it asked about. Consider either passing object as a query parameter and filtering server-side, or renaming the skill to current_finding() to reflect its true semantics.


def _read_json(self) -> dict[str, Any] | None:
"""Read a JSON object body. Returns None on malformed/non-object input."""
length = int(self.headers.get("Content-Length", "0"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 int() on a malformed Content-Length header raises unhandled ValueError

If a client sends a request with a non-integer Content-Length header (e.g. an empty string or "abc"), int(self.headers.get("Content-Length", "0")) raises ValueError. Since _read_json is called directly from do_POST with no surrounding try-except, the exception propagates up to ThreadingMixIn.process_request_thread(), which closes the connection without sending any response. The server stays up but the client receives a silent connection reset.

Suggested change
length = int(self.headers.get("Content-Length", "0"))
try:
length = int(self.headers.get("Content-Length", "0"))
except ValueError:
length = 0

Comment on lines +114 to +120
def report_cleared(self, dog: str, zone: str) -> str:
"""Mark ``zone`` fully searched by ``dog`` (object not here)."""
with self._lock:
if zone in self._zones:
self._zones[zone] = Zone(zone, "cleared", dog)
self._log.append(f"{dog} cleared {zone}")
return f"{zone} cleared."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 report_cleared has no zone-ownership check

Any dog can call report_cleared on a zone it does not own. If Dog A is currently searching zone "east" (claimed by Dog A) and Dog B mistakenly calls report_cleared("east"), the zone's state is overwritten to cleared with by=B. Dog A will still be searching a zone the ledger now marks as cleared by its teammate. In a coordinator that offers external zone-release via release_dog, this is also an easy-to-trigger consistency gap since the by field is used to identify which zones to reclaim on release.

Comment on lines +68 to +72
"""
duration = max(0.0, min(duration, _MAX_DURATION))
self._publish(forward, turn)
time.sleep(duration)
self._publish(0.0, 0.0) # stop
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 drive() blocks the agent thread for up to 3 seconds

time.sleep(duration) runs on the LLM agent's calling thread, blocking it for the entire drive burst (up to _MAX_DURATION = 3.0 s). During this window the agent cannot process incoming messages, poll should_stop, or react to a packmate's finding. For short single-burst moves this is acceptable, but longer drives or rapid sequences will stall the agent loop.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@sharkqwy sharkqwy requested a review from a team May 28, 2026 16:52
Comment on lines +114 to +120
def report_cleared(self, dog: str, zone: str) -> str:
"""Mark ``zone`` fully searched by ``dog`` (object not here)."""
with self._lock:
if zone in self._zones:
self._zones[zone] = Zone(zone, "cleared", dog)
self._log.append(f"{dog} cleared {zone}")
return f"{zone} cleared."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 report_cleared silently accepts zones not owned by the reporting dog

Any dog can call report_cleared on a zone owned by another dog and the zone state will be overwritten. If Dog A holds "east" (claimed) and Dog B calls report_cleared("east"), the coordinator marks it cleared by B while A is still searching it. Additionally, release_dog relies on zone.by to reclaim zones from the released dog — a spurious clear from another dog prevents that reclaim, leaving a zone permanently cleared even though neither dog finished it.


def _read_json(self) -> dict[str, Any] | None:
"""Read a JSON object body. Returns None on malformed/non-object input."""
length = int(self.headers.get("Content-Length", "0"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Malformed Content-Length header causes unhandled ValueError

int(self.headers.get("Content-Length", "0")) raises ValueError when the header value is non-numeric (e.g. "" or "abc"). Since _read_json is called directly from do_POST with no surrounding try-except, the exception propagates to ThreadingMixIn's request thread handler, which closes the connection without sending any HTTP response. The server stays alive but the client gets a silent connection reset.

AP-per-dog + Tailscale topology for the live demo no existing doc covers
(README assumes shared-router/STA; live.py's LAN_IP only works on one LAN).

- LIVE_RUNBOOK.md: on-site page — night-before checklist, per-laptop env,
  bring-up order, stable/showcase demo paths, failure->fix table.
- preflight.py: per-laptop chain checker (dog/route/internet/tailscale/
  coordinator) with exact remediation per FAIL; pure stdlib.
- scripted_pack_run.py: can't-miss stable path — one process per laptop drives
  this dog's identity against the real coordinator (no-overlap + inheritance)
  and optionally moves the real dog via /cmd_vel teleop; no LLM/nav-RPC.
@leshy leshy added the hackaton label May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants